Gesture Kinematic Spaces with Dynamic Time Warping (shapeDTW)

contact: wim.pouw@donders.ru.nl

Info documents¶

  • location Repository: https://github.com/WimPouw/envisionBOX_modulesWP/tree/main/gesture_kinematic

  • location Jupyter notebook: https://github.com/WimPouw/envisionBOX_modulesWP/blob/main/MultimodalMerging/gesture_kinematic_spaces.ipynb

Citation¶

Pouw, W. (2024). Wim Pouw's EnvisionBOX modules for social signal processing (Version 1.0.0) [Computer software]. https://github.com/WimPouw/envisionBOX_modulesWP

Background¶

Human (communicative) body movements are often recurrent, such that they use similar shapes or trajectories to convey or act on something consistently. How can we make sense of the variability and consistency with which movement events are produced? How can determine that two gestures are more or less similar in form? One way to do this is by quantifying the similarities present between all the produced movement events. Once we have documented how similar each movement event is relative to all other events, we can then visualize and quantify the structure of these interrelationships and discover novel patterns that exist over the entire set of behaviors. Furthermore, we can make an informed decision of what gestures are highly similar to each other relative to others. In previous work we have found that when communicative silent gestures have been practiced and communicated more, those silent gestures start to change their interrelationships in consistent ways (e.g., the kinematic space reduces in complexity) (see [3] Pouw, Dingemanse, Motamedi, Ozyurek, 2021).

Note that this particular procedure outlined here, is introduced in [1] Pouw & Dixon (2020) and we advise you to read this first before starting (or [3]). The main procedure: we compute some distance measure between two N-dimenionsal time series i and j, and do this for all events n. Similar to previous work [1] we will use dynamic time warping (see [2] Giorgino, 2009) as the distance calculation. We then compute what is called a distance matrix (M) which has n rows and n columns containing for each cell M[i,j] the distance score between movement i and j. Then we can use dimensional reduction techniques (e.g., classic multidimenionsal scaling, t-sne, UMAP, PCA) to visualize in 2-dimensional space the interrelationships between all movement events.

For the current movement events we use a set of silent gestures produced by an actor conveying certain concepts (open dataset by: [4][Ortega & Ozyurek, 2020). We extracted the movement information using mediapipe 3D body tracking for each silent gesture, and this is the input of our procedure. Note that the current procedure can be applied on any signal (e.g., acoustic trajectories) and also on multiple persons that are producing movement events.

Overview processing steps¶

This markdown file contains the following processing steps.

  • Create a distance matrix with dynamic time warping distances between events

  • Perform dimensionality reduction to get a 2D space representing the interrelationships (which will be used to build an app as shown here.

  • Create a gesture network, and extract some features to relate to the metadata of Ortega et al.

Starting up¶

Lets load some general packages, identify the right folders and show some of the data we are working with.

In [1]:
# loading in some basic packages
import numpy as np                           # basic data operations
import pandas as pd                          # data wrangling
import matplotlib.pyplot as plt              # for plotting
import plotly.graph_objects as go            # for plotting
import html                                  # for html-related add-ons
import os                                    # for foldering                              
from IPython.display import Image            # for showing videos
from scipy.ndimage import gaussian_filter1d  # for smoothing

# initializing some folders and files
timeseriesfolder = "../TimeSeries/"
distancematrixfolder = "../Distance_matrices/"
videofolder = "../Videos/WithMotionTracking/"

The folder 'TimeSeries' is filled with mediapipe body tracking time series for each video. We extract from it, nose, elbow, index finger (x , y, z) info (for both hands). Figure 1 shows an example of the kinematic time series for the silent gesture "TO-SWIM" which is also shown below. The swimming gesture has a characteristic rhythmic movement, as reflected in the oscillating movements in the horizontal dimension for left and right index finger.

Figure 1. Example raw time series of SWIMMING for the right (red) and left hand (in blue)

In [4]:
#load in data
gesture = 'TO-SWIM'
MT = pd.read_csv(timeseriesfolder+gesture+"_silentgesture.csv")
MT = MT.apply(lambda x: gaussian_filter1d(x, sigma = 2)) #smooth a little
#plot X_RIGHT_INDEX and Y_RIGHT_INDEX time series with time as x-axis
plt.plot(MT['time'], MT['X_RIGHT_INDEX'], label='X_RIGHT_INDEX')
plt.plot(MT['time'], MT['Y_RIGHT_INDEX'], label='Y_RIGHT_INDEX')
plt.plot(MT['time'], MT['Z_RIGHT_INDEX'], label='Z_RIGHT_INDEX')
plt.xlabel('time')
plt.ylabel('position')
plt.legend()
plt.show()

# show gif
Image(url="./images/"+gesture+".gif")
Out[4]:

DTW¶

DTW basics¶

Dynamic time warping is a method to compare and match two sequences (often time series). It tries to match sequences by allowing for some non-linear distortions in time (or another dimension), so that sequences maximally line up, so as to determine their similarity (after some distortion). This distortation has certain limits and constraints (e.g., such that values matched are not cris-crossing in time) and the degree of time distortions are effectively visualized by a warping path shown below. After matching values in one sequence to another value in the other sequence, one can compute the difference (e.g., euclidean distance between the values) that is still left between these warped sequences. This difference, error, or hereonafter "distance", provides a measure of dissimilarity between two time series. Note dynamic time warping can also compute distances between time series that have more than 1 dimension (as exemplified below). The dynamic time warping distances are best expressed after normalization when time series are known to have unequal lengths (otherwise, distances would add to bigger numbers for longer time series, than shorter one's, though they might be equally similar). The normalization consist in dividing the distance by the length of the timeseries.

Shape DTW¶

There are now various variants of DTW. We are using "shape dtw" for this example, which was developed to overcome certain issues with 'classic DTW' having to do with matching segments between timeseries that should not be matched as they are actually quite dissimilar. The inventors of shapeDTW state "Shape-DTW is inherently a DTW algorithm,but additionally attempts to pair locally similar structures and to avoid matching points with distinct neighborhoodstructures", p. 1, for the original paper see https://arxiv.org/pdf/1606.01601.pdf. For a nice introduction with how to set up your DTW pipeline, see here https://mikolajszafraniecupds.github.io/shapedtw-python/. Shape dtw operates on subsequences of the time series so that it can take into account not only a single point that needs to be matched with the another point in the sequence, but rather take into account the context of the neighboring points too. It therefore favours point that have a similar surrounding values, rather than just single points that have matching values. We set the width of the context at 20 (i.e., 20 timesteps, context of about 800ms assuming a framerate of 25fps), as this seems to provide good matches.

Other things to consider¶

There are many other different things that you need to consider when doing dynamic time warping. For example, you if you care about the general shape of a gesture but not its size, then you might want to normalize (e.g., z-scale) the time series first. Or you might not care about the place in gesture space, so you center the the position coordinates first. Also consider, we now only take a few keypoints that might not capture enough of the hand shape or other features, so you can consider adding more keypoints (a the cost of computational time).

Further, what if you have a gesture consisting of multiple elements, and you want to judge another gesture as similar when it has the same elements regardless of the order of the items? Then you might not want to use dynamic time warping as it will judge a time series with elements a-b-c to be completely different from a time series with elements c-b-a (a possible time series distance measure that does judge such time series to be similar is matrix profile distance or MPdist).

There are also a range of settings that optimize dynamic time warping performance. For example, you may want to limit the distance error calculation to be based on the begin and trailing portions of the time series as these often are difficult to match, which can be a pesky issue in classic-dtw discussed in [4] (Silva et al 2016; a possible solution for this is psy-dtw). Or it might happen that the warping leads to extreme distortions in time leading to non-sensical matches (a solution for this is limiting the maximum warping constraint, using for example a Sakoe-Chiba window). There are many such optimizations and it would be good to familiarize oneself with these as they can greatly improve performance.

Further study¶

This module barely scratches the surface of what is behind dynamic time warping. A nice more in-depth covering of DTW is overviewed here https://www.audiolabs-erlangen.de/resources/MIR/FMP/C3/C3S2_DTWbasic.html, with an open access book chapter https://www.audiolabs-erlangen.de/resources/MIR/FMP/C3/C3S2_DTWbasic.html. Then we would recommend digging further into the specific implimentation used here, which is shape-dtw: https://mikolajszafraniecupds.github.io/shapedtw-python/.

In [6]:
from shapedtw.shapedtw import shape_dtw
from shapedtw.dtwPlot import dtwPlot
from shapedtw.shapeDescriptors import RawSubsequenceDescriptor

# using dtw to compare distances and show a warping line
MT1 = pd.read_csv(timeseriesfolder+"TO-SWIM_silentgesture.csv")
MT2 = pd.read_csv(timeseriesfolder+"BUTTERFLY_silentgesture.csv") #compare example against BANANA
MT1 = MT1.apply(lambda x: gaussian_filter1d(x, sigma = 2)) #smooth a little
MT2 = MT2.apply(lambda x: gaussian_filter1d(x, sigma = 2)) #smooth a little

# select some set of features from the motion tracking dataframe
keypointindexleft = ["X_LEFT_INDEX", "Y_LEFT_INDEX", "Z_LEFT_INDEX"]
ts1 = np.array(MT1[keypointindexleft])
ts2 = np.array(MT2[keypointindexleft])

# setting up the DTW functions
shape_dtw_dependent_results = shape_dtw(
    x=ts1,
    y=ts2,
    subsequence_width=20, # so this means for the matching that we take 20 indices into account
    shape_descriptor=RawSubsequenceDescriptor(), # the local context that is taken to compare to other local context is just the raw values
    multivariate_version="dependent" 
)
dependent_distance = round(shape_dtw_dependent_results.distance, 2)
dependent_distance_normalized = shape_dtw_dependent_results.normalized_distance
dtwPlot(shape_dtw_dependent_results, plot_type="twoway", xoffset=0.1)
dtwPlot(shape_dtw_dependent_results, plot_type="threeway")

print("DTW distance is: " + str(dependent_distance))
print("Normalized DTW distance is: " + str(dependent_distance_normalized))
DTW distance is: 32.94
Normalized DTW distance is: 0.17338773083097164

Main routine: constructing a distance matrix¶

The next step is to compute a average dynamic time warping distance between joint position traces. This means we are performing a dependent DTW (using fast-dtw), where x,y,z time series are submitted for DTW comparison for a gesture i and gesture event j, and we perform this procedure for each joint separately (this averaging of by-keypoint DTW comparisons is called an independent DTW; so we are combining an independent and dependent DTW version).

In [26]:
#identify which column indices in a dataframe correspond to which body parts
keypointindexright =  ["X_RIGHT_INDEX", "Y_RIGHT_INDEX", "Z_RIGHT_INDEX"]
keypointindexleft = ["X_LEFT_INDEX", "Y_LEFT_INDEX", "Z_LEFT_INDEX"]
keypointelbowright = ["X_RIGHT_ELBOW", "Y_RIGHT_ELBOW", "Z_RIGHT_ELBOW"]
keypointelbowleft = ["X_LEFT_ELBOW", "Y_LEFT_ELBOW", "Z_LEFT_ELBOW"]
keypointshoulderright = ["X_RIGHT_SHOULDER", "Y_RIGHT_SHOULDER", "Z_RIGHT_SHOULDER"]
keypointshoulderleft = ["X_LEFT_SHOULDER", "Y_LEFT_SHOULDER", "Z_LEFT_SHOULDER"]

# make one 1 list with all keypoints
keypoints = keypointindexright + keypointindexleft + keypointelbowright + keypointelbowleft + keypointshoulderright + keypointshoulderleft

#make a function that takes in two timeseries and produces a normalized dtw distance
def dtw_distance(ts1, ts2):
    ts1 = np.array(ts1)
    ts2 = np.array(ts2)
    res = shape_dtw(x = ts1, y = ts2, subsequence_width=20,shape_descriptor=RawSubsequenceDescriptor(), multivariate_version="dependent")
    distance = res.normalized_distance
    return distance

# make a dependent dtw such that each keypoint dtw distance is added up
def dtw_distance_dependent(MT1, MT2):
    disleftindex = dtw_distance(MT1[keypointindexleft], MT2[keypointindexleft])
    disrightindex = dtw_distance(MT1[keypointindexright], MT2[keypointindexright])
    disleftelbow = dtw_distance(MT1[keypointelbowleft], MT2[keypointelbowleft])
    disrightelbow = dtw_distance(MT1[keypointelbowright], MT2[keypointelbowright])
    disleftshoulder = dtw_distance(MT1[keypointshoulderleft], MT2[keypointshoulderleft])
    disrightshoulder = dtw_distance(MT1[keypointshoulderright], MT2[keypointshoulderright])
    #also compute a distance between hands
    disbetweenhands = dtw_distance(np.array((MT1[keypointindexleft])-np.array(MT1[keypointindexright])), (np.array(MT2[keypointindexleft])-np.array(MT2[keypointindexright])))
    # return the mean distance
    return (disleftindex + disrightindex + disleftelbow + disrightelbow + disleftshoulder + disrightshoulder) / 7

Main routine: constructing a distance matrix¶

The next step is to compute a average dynamic time warping distance between joint position traces. This means we are performing a dependent DTW, where x,y,z time series are submitted for dependent DTW comparison for a gesture i and gesture event j, and we perform this procedure for each joint separately (this averaging of by-keypoint DTW comparisons is called an independent DTW; so we are combining an independent and dependent DTW version).

In [27]:
# First identify the number of different time series csv files we have in the data folder timeseriesfolder
ts_objloc = os.listdir(timeseriesfolder)
num_ges = len(ts_objloc)
print("Number of gesture files: ", num_ges)

# Load all gesture files into a dictionary
gesture_data = {}
gesture_names = []
for i, objloc in enumerate(ts_objloc):
    ges_i = pd.read_csv(timeseriesfolder + objloc, sep=',')
    #lets filter the time series
    ges_i = ges_i.apply(lambda x: gaussian_filter1d(x, sigma = 2))
    gesture_data[i] = ges_i[keypoints] #only keep the relevant columns
    # what gesture is this
    name = objloc.replace("silent_gesture.csv", "")
    gesture_names.append(name)

print("Loaded", len(gesture_data), "gesture files and smoothed them")
Number of gesture files:  109
Loaded 109 gesture files and smoothed them

Note that this takes some time (about 6 minutes) to compute in the below implementation (you could re-implement this using a parrelel computing procedure; see here for a fast implementation in R https://wimpouw.github.io/EnvisionBootcamp2021/gesturenetworks_module1.html). But this code is much better to comprehend, so we will leave that optimization for whomever wants to tackle that.

In [29]:
from tqdm import tqdm  # Import the tqdm library for the progress bar

# Initialize a matrix to store for each gesture comparison the dtw distance
dtw_dist = np.zeros((num_ges, num_ges))

# Loop over all gesture files
for i in tqdm(range(num_ges), desc="Processing gestures", unit="gesture"):
    # Read the i-th gesture file
    ges_i = gesture_data[i]

    # Loop over all gesture files
    for j in range(num_ges):
        # Check if dtw_dist already has a value for the i-th and j-th gesture
        if dtw_dist[i, j] == 0 & (i != j):
            # Read the j-th gesture file
            ges_j = gesture_data[j]

            # Compute the dtw distance between the i-th and j-th gesture
            distance = dtw_distance_dependent(ges_i, ges_j)
            dtw_dist[i, j] = distance

# After the computation is finished, dtw_dist will be populated with all the distances
print(dtw_dist)

# save as dtwdist as csv
np.savetxt(distancematrixfolder+'distance_matrix_python.csv', dtw_dist, delimiter=',')
Processing gestures: 100%|██████████████████████████████████████████████████████| 109/109 [06:48<00:00,  3.75s/gesture]
[[0.         0.1020262  0.12202012 ... 0.10876124 0.08164879 0.06878361]
 [0.1020262  0.         0.10286257 ... 0.10962695 0.08387376 0.08681442]
 [0.12202012 0.10286257 0.         ... 0.11613938 0.10156234 0.08568243]
 ...
 [0.10876124 0.10962695 0.11613938 ... 0.         0.12022866 0.05182213]
 [0.08164879 0.08387376 0.10156234 ... 0.12022866 0.         0.08457529]
 [0.06878361 0.08681442 0.08568243 ... 0.05182213 0.08457529 0.        ]]

Dimensionity reduction to get a "kinematic space"¶

The next thing we will do is further process the distance matrix containing all gesture comparisons so that we can easily visualize it. We will use a dimensionality reduction technique called umap! This will make a 2-d representation of the high dimensional data that we have (each gesture is now defined by all distances to all other gestures, but we would like to define it on a 2D plane). This 2D plane can help us see close gesture neighbors and provide some information about clustering for example.

In [41]:
#from sklearn.manifold import MDS
import umap.umap_ as umap #pip install umap-learn

# load in the metadata that we will collect some info from
metadata = pd.read_csv("../IconicityRatings/Iconicity_ratings.csv", sep=',')
# also change the English column so that that spaces become dashes
metadata['English'] = metadata['English'].str.replace(' ', '-')

# Read the distance matrix from the CSV file
dtw_dist = np.genfromtxt(distancematrixfolder+'distance_matrix_python.csv', delimiter=',')

# Assuming your distance matrix is stored in 'distance_matrix'
# Adjust 'n_components' based on how many dimensions you want to project to (in this case, 2D)
# higher value of n_neighbors will lead to more global structure (less extreme clustering)
n_components = 2
umap_model = umap.UMAP(n_components=n_components, n_neighbors=15)
pos = umap_model.fit_transform(dtw_dist)

for i, obj in enumerate(gesture_names):
    #get x, y data
    posix = pos[i][0]
    posiy = pos[i][1]
    # remove from obj a substring 'silent_gesture.csv'
    obj = obj.replace('_silentgesture.csv', '')
    #check if the English column matches the obj
    if metadata['English'].str.contains(obj).any():
        # if it does, then add the posix and posiy values to the data frame
        metadata.loc[metadata['English'] == obj, 'x'] = posix
        metadata.loc[metadata['English'] == obj, 'y'] = posiy
    if not metadata['English'].str.contains(obj).any():
        print('we could not match the following gesture to the metadata: ', obj)

# Save the metadata pd frame for the dashboard
metadata.to_csv('./Dashboard/Assets/main.csv', index=False)

# we have now added x, y data
metadata.head()
we could not match the following gesture to the metadata:  TO-CUT_knife
we could not match the following gesture to the metadata:  TO-CUT_scissors
we could not match the following gesture to the metadata:  TO-JUGGLE_
Out[41]:
English Dutch No. participants Mode of rep Semantical category pp01 pp02 pp03 pp04 pp05 ... pp12 pp13 pp14 pp15 pp16 pp17 pp18 MEAN x y
0 TO-FLY VLIEGEN 20 personification Action without object 2 1 4 6 6 ... 5 5 6 2 5 5 6 4.94 22.006117 0.477968
1 TO-SWIM ZWEMMEN 19 acting Action without object 7 6 7 7 7 ... 7 7 7 7 6 7 7 6.89 21.827307 -0.400335
2 HOUSE HUIS 17 drawing Non-manipulable object 5 2 3 6 5 ... 4 6 5 3 3 3 3 4.67 24.404438 -1.729477
3 TO-WRING WRINGEN 20 acting Action with object 6 5 3 7 4 ... 7 6 5 5 4 6 5 5.67 24.793688 -2.042675
4 TO-SHOUT SCHREEUWEN 11 acting Action without object 7 6 4 6 7 ... 6 7 7 5 7 6 5 6.28 23.840233 -0.497821

5 rows × 26 columns

In [42]:
import numpy as np
import networkx as nx
import matplotlib.pyplot as plt
from skimage.filters import threshold_otsu # we use this for binarizing a matrix

# Read the distance matrix from the CSV file
dtw_dist = np.genfromtxt(distancematrixfolder + 'distance_matrix_python.csv', delimiter=',')

# Calculate Otsu's threshold
    # this algorithm allows you to determine a boundary based on the distribution of the data
threshold_value = threshold_otsu(dtw_dist)

# Create a binary matrix using Otsu's threshold
binary_matrix = dtw_dist <= threshold_value

# Create a graph from the binary matrix
G = nx.Graph(binary_matrix)
G.remove_edges_from(nx.selfloop_edges(G)) #adjust some of the network properties (no self-loops)

# Draw the network
pos = nx.spring_layout(G)  # Position nodes using a spring layout algorithm
nx.draw(G, pos, with_labels=True, node_size=50, font_size=4, font_color='black')

plt.title("Undirected Network")
plt.show()

Further analyses¶

From gesture networks you can further extract information about each gesture's connectivity with others. For example, we can cound the number of connections. If a gesture is connected to many other gestures it might mean that such gestures are not very disiguishable from others. So lets explore a simple conjecture: if a gesture has little connections with other, it is more distinguishable and therefore these gestures were more likely to be rated as more iconic in their meaning. We have this data from the Ortega datasat, so lets have a peak! The results seem to prove us wrong.

In [46]:
#load metadata
metadata = pd.read_csv("../IconicityRatings/Iconicity_ratings.csv", sep=',')
# compute from Graph object the degree of each 
metadata['degrees'] = list(dict(G.degree()).values())

# Create a scatter plot
plt.scatter(metadata['MEAN'], metadata['degrees'])
plt.xlabel('MEAN iconicity rating')
plt.ylabel('Degrees')
plt.title('Scatter Plot of MEAN vs. Degrees')
plt.grid(True)
plt.show()

# Calculate the correlation coefficient
correlation = np.corrcoef(metadata['MEAN'], metadata['degrees'])[0, 1]

print(f"Correlation coefficient: {correlation:.2f}")
Correlation coefficient: -0.02
  1. Pouw, W. , & Dixon, J. A. (2020). Gesture networks: Introducing dynamic time warping and network analysis for the kinematic study of gesture ensembles. Discourse Processes, 57(4), 301-319. doi: 10.1080/0163853X.2019.1678967

  2. Pouw, W. , Dingemanse, M., Motamedi, Y., Özyürek, A. (2021). A systematic investigation of gesture kinematics in evolving manual languages in the lab. Cognitive Science. doi (open access): 10.1111/cogs.13014

  3. Ortega, G., & Özyürek, A. (2020). Systematic mappings between semantic categories and types of iconic representations in the manual modality: A normed database of silent gesture. Behavior Research Methods, 52(1), 51-67.

  4. Silva, D. F., Batista, G. A. E. P. A., & Keogh, E. (2016). On the effect of endpoints on dynamic time warping. SIGKDD Workshop on Mining and Learning from Time Series, II. San Francisco, USA.